LGANet: Local and global attention are both you need for action recognition

نویسندگان

چکیده

Due to redundancy in the spatiotemporal neighborhood and global dependency between video frames, recognition remains a challenge. Some prior works have been mainly driven by 3D convolutional neural networks (CNNs) or 2D CNNs with well-designed module for temporal information. However, convolution-based lack capability capture due limited receptive field. Alternatively, transformer is proposed build long-range frame patches. Nevertheless, most transformer-based significant computational costs because attention calculated among all tokens. Based on these observations, we propose an efficient network which dub LGANet. Unlike conventional transformers recognition, LGANet can tackle both learning local token affinity shallow deep layers, respectively . Specifically, implemented layers reduce parameters eliminate redundancy. In spatial-wise channel-wise self-attention are embedded realize of high-level features. Moreover, several key designs made multi-head (MSA) feed-forward (FFN). Extensive experiments conducted popular benchmarks, such as Kinetics-400, Something-Something V1&V2. Without any bells whistles, achieves state-of-the-art performance. The code will be released soon.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Global and Local Attention Processing in Depressed Mood

Background: Attention impairments are the hallmark feature of subclinical depression. The present study used Navon task to compare the allocation of attention to the local and global stimuli in depressed and nondepressed participants. Method: The primary sample included 186 female high school students from Shiraz city who were selected using cluster sampl...

متن کامل

Attention is All you Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. E...

متن کامل

Need for global action for cancer control.

When the Millennium Development Goals (MDGs) [1] were being developed, priority was given to the problems of the poorest billion people in the world. In terms of health, this was translated into a set of targets of indicators in health that give visibility to maternal and child health, (under) nutrition, acquired immunodeficiency syndrome (AIDS), malaria, and tuberculosis, and a vague catch-all...

متن کامل

Joint Network based Attention for Action Recognition

By extracting spatial and temporal characteristics in one network, the two-stream ConvNets can achieve the state-ofthe-art performance in action recognition. However, such a framework typically suffers from the separately processing of spatial and temporal information between the two standalone streams and is hard to capture long-term temporal dependence of an action. More importantly, it is in...

متن کامل

Action Recognition using Visual Attention

We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the fram...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Iet Image Processing

سال: 2023

ISSN: ['1751-9659', '1751-9667']

DOI: https://doi.org/10.1049/ipr2.12876